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Abstract — The recently-discovered polar codes are seen as a 
major breakthrough in coding theory; they provably achieve the 
theoretical capacity of discrete memoryless channels using the 
low complexity successive cancellation (SC) decoding algorithm. 
Motivated by recent developments in polar coding theory, we 
propose a family of efficient hardware implementations for SC 
polar decoders. We show that such decoders can be implemented 
with 0{n) processing elements, 0{n) memory elements, and can 
provide a constant throughput for a given target clock frequency. 
Furthermore, we show that SC decoding can be implemented in 
the logarithm domain, thereby eliminating costly multiplication 
and division operations and reducing the complexity of each pro- 
cessing element greatly. We also present a detailed architecture 
for an SC decoder and provide logic synthesis results confirming 
the linear growth in complexity of the decoder as the code 
length increases. Index Terms — olar codes successive cancellation 
decoding hardware implementation VLSI.olar codes successive 
cancellation decoding hardware implementation VLSI.P 

I. Introduction 

Polar codes HI form a family of error correcting codes 
with an explicit and efficient construction ||2l encoding 
and decoding algorithms. They achieve channel capacity — 
asymptotically in the code length n — when the underlying 
channel is memoryless and has a discrete input alphabet ||3l. 
To date, they are the first codes to provably achieve channel 
capacity with tractable decoding complexity. Moreover, in 
some information theoretic applications, such as achieving 
the secrecy capacity of the wiretap channel in the general 
case, polar codes are the only known solution which is both 
explicit and efficient 141 . They are therefore seen as a major 
breakthrough in coding and information theory. 

From a practical point of view, however, polar codes come 
close to achieving the channel capacity only for very large 
code lengths, e.g. n > 2^°. Recent works have therefore started 
to address the issue of performance at shorter code lengths. For 
example, it was shown in Q that the belief propagation (BP) 
decoding of polar codes improved their performance compared 
to successive cancellation (SC) decoding without an increase 
in block length n. This performance gain is however obtained 
at the expense of an increase in decoding complexity. List 
decoding ||6l also improves performance without an increase 
in code length; however, decoding complexity grows linearly 
in list size. 

Driven by recent theoretical advances related to polar codes 
and the extra complexity incurred by the use of BP or list 
decoding, we aim to find efficient hardware architectures for 



SC decoding, allowing both high throughput and low area 
implementations of moderate length polar decoders. Starting 
from the general framework proposed by Arikan ||T| and 
described in Section Ull we develop multiple decoder archi- 
tectures in order of decreasing hardware complexity and show 
that SC decoding can actually be implemented with hardware 
complexity 0{n) using the line decoder in Section Hill Finally, 
We address the implementation of the decoder and its compu- 
tational nodes and present logic synthesis results confirming 
our complexity analysis in Section |IV] 

II. Polar Codes 

A polar code is a linear block error-correcting code designed 
for a specific discrete input, memoryless channel. From here 
on, we will assume that the channel has a binary input alphabet 
and is symmetric as well Q. Let ?? = 2™ be the code length 
and let u = (uq, ui, • . . , Wn-i) and c = (cq, ci, . . . , c„_i) 
denote the input bits and the corresponding codeworcQ, respec- 
tively. The encoding operation has a Fast-Fourier- Transform- 
like butterfly structure depicted in Figure [T] for n = 8. Note 
that the ordering of the bits in Figure [T] is according to the 
bit-reversed order: if we reverse the order of the bits in the 
binary representation of i, we then get the natural ordering. 

After u is encoded into c, the codeword c is sent over the 
underlying channel (the channel is used n times). Denote by 
y = {uo, j/i, . . . , Un-i) the corresponding channel output. We 
now wish to decode y. This is done in terms of a successive 
cancellation decoder. That is, given y, we first try to deduce 
the value of uq, then that of ui, and so forth up until 
We do this as follows. Assume that we are currently at bit 
i and have already estimated the values of uq, ui, . . . , 
to be uq,ui, . . . ,Ui^i. Next, for b £ {0,1}, denote by 
Pr(y|{tQ^^, Ui = b) the probability that y was received, given 
that Uq^^ = ?ig"\ Ui = b, and Ui+i, ■tii+2, ■ • ■ , "n-i are 
independent random variables with Bernoulli distribution of 
parameter 0.5. The estimated value ui is chosen according to: 

Pv{y\u'-\u, = l) - ^' (1) 

1 Otherwise. 

As the code length, n, increases, the probability that a bit 
Ui is correctly decoded, given that all previous bits were 

'Note that n input bits are encoded to a length n codeword. However, as 
we will see later on, not all of the n input bits carry information. 




Fig. 1. Encoder architecture for n = 8. 

correctly decoded, approached either 1 or 0.5 as proven 
in HI. The fraction of bits whose probabihty of successful 
decoding approaches 1 tends towards the capacity of the 
underlying channel as n increases. This information regarding 
bit reliabilities is used to select a high reliability subset of u 
to store information bits; while the rest of u, called the frozen 
bit set, is set to a fixed value, assumed to be in this work. 
The frozen set is known at the decoder, which sets Ui to if 
it is in the frozen set, and uses Equation ([T]) otherwise. 

III. Successive Cancellation Decoder 
Architectures 

A. Butterfly-based architecture 

Arikan showed that SC decoding can be efficiently imple- 
mented by the factor graph of the code, which has a structure 
resembling that of the Fast Fourier Transform. In the remainder 
of this paper, we will refer to this decoder architecture as the 
"butterfly-based SC decoder." Figure [2] illustrates the graph of 
this SC decoder for ?? = 8. Channel likelihood ratios (LRs) 
Xi are assumed to be presented to the right hand side of the 
graph whereas the estimated bits iii appear on the opposite 
end. 

The SC decoder is composed of m = logj n stages, each 
containing n nodes. We refer to a specific node as Afij where 
I designates the stage index {Q < I < m — 1), and j, the node 
index within stage / (0 < j < n — 1). Each node updates its 
output according to one of the two following update rules: 

f[a,b)^——-or 

a + (2) 

The values a and b are likelihood ratios while Ug is a 
bit that represents the partial modulo-2 sum of previously 
estimated bits. For example, in node A/^i.s, the partial sum 
is Us = M4 © M5. The value of Ug determines if function g 
should be a multiplication or a division. These update rules 
are complex to implement in hardware since they involve 
multiplications and divisions. In Section IIII-DI we propose 
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a a 
Fig. 2. The butterfly-based SC decoder architecture for n = 8. 

to perform these operations in the logarithm domain and to 
apply an approximation to the function / . 

The sequential nature of the algorithm introduces data 
dependencies to the decoding process. We notice that Afi^2 
cannot be updated before bit ui is computed and a fortiori 
neither before uq is known. In order to respect the data 
dependencies, a scheduling has to be defined. Arikan proposed 
two schedulings for this decoding framework [[T|. In the left- 
to-right scheduling, nodes recursively call their predecessors 
until an updated node is reached. The recursive nature of this 
scheduling is especially suitable for software implementation. 
In the alternative right-to-left scheduling, any node updates 
its value whenever its inputs are available which enables 
some nodes to update their values in parallel. Each bit iii is 
successively estimated by activating the spanning tree rooted at 
Afo,Tr{i), where 7r(.) denotes the bit-reverse mapping function. 
As an example, in Figure |2] the tree associated with uq is 
highlighted. If we assume that memory elements are inserted 
between each stage or equivalently that each node processor 
can store its updated value, then some results can be reused. 
For example, in Figure |2] bit iii can be decoded by only 
activating Mqa since A/'i.o and 7Vi,4 have already been 
updated during the decoding of uq. 

Despite this well-defined structure and scheduling of the 
butterfly-based decoder, in [11, Arikan does not address the 
problem of resource sharing, memory management or control 
generation that would be required for hardware implementa- 
tion. This framework however suggests that it could be imple- 
mented with 71 log2 n combinational node processors together 
with n registers between each stage to store intermediate 
results. In order to store the channel information, n extra 
registers are included as well. The total complexity of such 
a decoder is 



^This is true for almost all i 



Cbutterfly = (Cnp + Cr)n loga n + TlCr, (3) 



3 



where Cnp and Cj are the hardware complexity of a node 
processor and a memory register, respectively. In order to 
decode one vector, each stage I has to be activated 2'"~' times. 
If we assume that one stage is activated at each clock cycle, 
then the number of clock cycles required to decode one vector 
is 

rn — l 

NCC = 2"-' = 2n - 2. (4) 

1=0 

The throughput in bits per second would then be 



NCCxtnp 2tnp' 

where inp is the propagation time in seconds through a node 
processor which also represents the clock period. It follows 
that every node processor is actually used once every 2n — 2 
clock cycles. This motivates us to find a schedule to merge 
some of the nodes into a single processing element. 

B. Pipelined tree architecture 

Further studying of the scheduling reveals that whenever 
stage I is activated, only 2' nodes are actually updated. For 
example, in Figure |2l when stage is enabled, only one node 
is updated. Then the n nodes of stage can be implemented 
using a single processing element (PE). As such, for stage /, 
2' processing elements are sufficient to update all the nodes. 
However, this resource sharing does not necessarily guarantee 
that the memories assigned to the merged nodes can also be 
merged. TableUshows the stage activation during the decoding 
of one vector y. When stage Si is enabled, we indicate which 
function (/ or g) is applied to the 2' activated nodes at stage 
Si during each clock cycle (CC). Every generated variable 
is used twice during the decoding. For example, the four 
variables generated in stage 2 at CC #1 are consumed on CC 
#2 and CC #5 in stage 1 . This means that, in stage 2, the four 
registers associated with the / function can be reused at CC 
#8 to store the four data values generated by the g function. 
This observation is applicable to any stage in the decoder. 
The resulting proposed architecture is shown in Figure |3] for 
n — 8. The channel LRs, Ai, are stored in n registers. The 
rest of the decoder is composed of a pipelined tree structure 
that includes n — 1 PEs, P; , and n — 1 registers, Ri j with 
< Z < m — 1 and < j < 2' — 1. A decision unit generates 
the estimated bit Ui which is then broadcast back to every 
PE. A PE is a configurable element that can perform either 
the / or the g function. It also includes the Ug computation 
block that updates the Us value with the last decoded bit Ui 
only if the control bit bij = 1. Another control bit is used 
to select the f or g function. Compared to the butterfly-based 
structure, the pipelined tree architecture performs the same 
amount of computation with the same scheduling (see Table IJi 
but with a smaller number of PEs and registers. The throughput 
is then the same as in (|5]l and the decoder has lower hardware 
complexity 

Ctree - - l){CpE + C,) + nC,, (6) 

where CpE represents the complexity of a single PE. In ad- 
dition to the lower complexity, one can notice that the routing 
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Fig. 3. Pipelined tree SC architecture for n = 8. 



CC 


1 


2 


3 


4 


5 


6 


1 


8 


9 


10 


11 


12 


13 


14 


52 


/ 














9 














Si 




f 






9 








f 






9 






So 






/ 


9 




/ 


9 






/ 


a 




/ 


9 


Ui 






uo 


ui 




U2 


uz 






U4 


U5 






U7 



TABLE I 

Schedule for the butterfly-based and pipeline tree SC 

ARCHITECTURES (n = 8). 



network in the decoder is much simpler in the tree architecture 
than in the butterfly-based structure. Connections between 
PEs are also local. This lowers the risk of congestion during 
the wire routing phase of an integrated circuit design and 
potentially increases the clock frequency and the throughput. 

C. Line architecture 

Despite the low complexity of the pipelined tree archi- 
tecture, it is possible to further reduce the number of PEs. 
Looking at Table U it appears that only one stage is activated 
at a time. In the worst case — stage m — 1 is activated — PEs 
have to be used simultaneously. This means that the same 
throughput can be achieved with only 2: PEs. The resulting 
architecture is shown in Figure |4] for n = 8. The processing 
elements Pj are arranged in a line; while the registers retain a 
tree structure emulated by a multiplexing resources connecting 
the two. 

For example, since P2,o and Pi o (in Figure O are merged 
into P2 (in Figure |4]i, P2 should write either to R2,o or Ri,o; 
and it should also read from the channel registers or from R2,o 
and R2,i. The Ug computation block is moved out of Pj and 
kept close to the associated register because Ug should also be 
forwarded to the PE. The overall complexity of the line SC 
architecture is 

77 / 77 \ 

Cune = {n^l){C, + Cu^) + -CpE+[- - l) 3C„,,, + 7lC,, 

(7) 

where Cmux represents the complexity of a 2-input multi- 
plexer and Cu^ is the complexity of the Us computation block. 
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Despite the extra multiplexing logic required to route the data 
through the PE line, the savings in number of PEs makes this 
SC decoder less complex than the pipelined tree architecture 
while achieving the same throughput computed in (|5]). The 
control logic is not included in the complexity estimation since 
it is negligible compared to processing and memory. This will 
be confirmed by logic synthesis results in section IIV-CI 

The Line SC architecture can be seen as a tree architecture 
in which complexity is reduced by merging some of the PEs 
without affecting throughput. 

D. The min-sum approximation 

SC decoding was originally proposed in the likelihood 
ratio domain, in which the update rules / and g require 
multiplications and divisions. Since the cost of implementing 
these operations in hardware is very high, they are usually 
avoided in practice. We thus propose to perform SC decoding 
in the logarithm domain in order to reduce the complexity of 
the / and g computation blocks. We assume that the channel 
information is available as log-likelihood ratios (LLRs) Li, 
which leads to the following alternative representation for 
equations / and g: 

f{La,Lb) = 2tanh-i (^tanh (^^^ tanh (y)) ""^^ 

(8) 

In terms of hardware implementation, g can easily be 
mapped to an adder-subtractor controlled by the bit Us- How- 
ever, / involves some transcendental functions that are com- 
plex to implement in hardware. One can notice that the / and g 
functions are identical to the update rules used in BP decoding 
of low-density parity-check (LDPC) codes. Consequently, an 
approximation used in LDPC decoder implementations ||8l can 
be used to approximate / using the minimum function, such 
that 



Fig. 5. The min-sum approximation error-correction performance cliange for 
PC(1024,512) and PC(16384,8192). 



f{La,Lb)~ sign(La) sign(Lb)niin(|La|, |Lb|) and 
gaALa,U)=La{~l)''^+Lb. 

In order to estimate the performance degradation incurred 
by this approximation, we simulated the performance of dif- 
ferent polar codes on an additive white-Gaussian (AWGN) 
channel with binary phase-shift keying (BPSK). As it can be 
seen in Figure |5] the performance degradation is minor for 
moderate code lengths and is very small (0.1 dB) for longer 
codes. 

IV. Line decoder hardware implementations 

Section |lll] showed that the line architecture has a lower 
hardware complexity — and is thus more efficient — than its 
tree-based counterpart. This section presents details and syn- 
thesis results of an implementation of the line architecture. 

A. Fixed-point simulations 

The number of quantization bits impacts both the decoding 
performance of the algorithm and the hardware complexity 
of the decoder Consequently, prior to implementing the line 
decoder, a detailed analysis was carried out on a software- 
based SC decoder in order to find the best tradeoff between 
performance and complexity. The resulting simulations re- 
vealed that fixed-point operations on a limited number of 
quantization bits attained a decoding performance very similar 
to that of a floating point algorithm. Figure |6] illustrates this 
phenomenon for a PC(1024, 512) decoder. It shows that 5 or 
6 quantization bits are sufficient to reach near-floating point 
performance at a saturation level of ±3ct, which exhibits 
good performance over all quantization levels. It should be 
noted that the channel saturation level has a high impact on 
the performance of low quantization {q = 3, 4) decoders. 
The selected saturation value (±3cr) was chosen from further 
software simulations not shown here. 
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Fig. 6. Fixed-point FER simulation for PC(1024,512) on AWGN cliannel. 
Input saturation = Scr. 



B. Line decoder detailed architecture 

1) Processing elements: The processing element is the 
main arithmetic component of the line decoder. It embodies 
the arithmetic logic needed to perform both / and g functions 
within a single logic component. This grouping, motivated by 
the fact that all stages of the decoding graph either perform 
function / or g at any given time, allowed a greater level of 
resource sharing. PEs also implement the min-sum approx- 
imation described in Section IIII-DI which allows for much 
simpler decoding logic, as it replaces three transcendental 
functions with a single comparator. Since processing elements 
are replicated n/2 times. Equation this approximation has 
a significant impact on the overall size of the decoder 

In this work, processing elements are fully combinational 
and operate on quantized sign-and-magnitude (SM) coded 
LLRs. We initially implemented our PEs in two's complement 
format (TC), for its wide support in HDL languages, but 
logic synthesis showed a 20% area reduction when using SM 
instead. Indeed, Equation ^ shows that the main operations 
performed on LLRs are addition, subtraction, absolute value, 
sign retrieval, and minimum value, all of which are very low 
complexity operations when using SM format. 

Figure |7] illustrates the overall architecture of our SM-based 
PE. In this figure. La and Lt, are the two g-bit input LLRs 
of functions / and g; a partial sum signal Ug controls the 
behavior of g; the sign s(.) and the magnitude |.| of input 
LLRs are directly extracted; and the comparator is shared for 
the computation of \Lg\ and s{Lg). Thick lines and thin 
lines represent magnitude and sign data paths, respectively. 

2) Register banks: As seen in Section UlI-Cl memory re- 
sources are needed to store partial results during the decoding 
process. The decoder is implemented using two separate 
memories: one for partial LLR calculations, and another for 
the partial sums u^. The line decoder memory has a tree 
structure and uses (2n — 1) g-bit memory cells to store LLRs, 
in addition to (n — 1) 1-bit cells to store the partial sums Ug 
used to carry out function g. 




Fig. 7. Processing element architecture. 



The LLRs memory can be seen as (logj rt + 1) separate 
memories — one for each stage — with each stage I requiring 
2' g-bit memory cells. Stage (log2 n + 1) is special in that it 
contains the received channel LLRs, requiring shift-register 
capabilities. Each stage produces half as much data as it 
consumes. This data is written into memory locations read 
by the subsequent stage, I — 1. 

The partial sum memory combines the n log2 n partial sums 
of the decoding graph into n ~ 1 memory cells by time- 
multiplexing each memory cell for use by multiple nodes of 
the graph. 

3) Multiplexing: The shared nature of the processing ele- 
ments used in the line architecture requires the multiplexing 
of their inputs and outputs. As shown in Figure |4] memory 
is implemented using registers, and separate networks of 
multiplexers and demultiplexers are used to provide them with 
appropriate inputs from memory and store their outputs to 
memory, respectively. 

An alternate design for the line architecture could make use 
of SRAM blocks, in which case the multiplexing networks 
could be avoided completely, as equivalent logic would be 
directly embodied in the memory decoder of the SRAM 
modules. This would allow for a more compact memory block, 
although potentially increasing access time. An even more 
optimized design could mix both SRAM and registers: looking 
at table U it appears that some of the memory elements are 
accessed more often than others. It would be more efficient 
to implement these frequently accessed memory elements into 
registers while keeping the SRAM blocks for less frequently 
accessed data. In this work, since we target moderate length 
codes, we choose to use registers only. 

4) General control: The line decoder is a multi-stage 
design which sequentially decodes each codeword. It uses 
specific control signals to orchestrate the decoding. 

Those control signals are combinational functions of i, the 
current decoded bit number, and I, the current stage. These 
two signals are in turn generated using counters, and some 
extra logic. The underlying understanding is that up to logj (n) 
stages must be activated in sequence to decode each bit u^. 
Once it has been decoded, this bit is stored in specific partial 

Binary representations of integers are assumed to be stored in little-endian 
format 
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sums iis for the decoding of subsequent bits, according to the 
data dependencies highhghted previously. 

Both i and I can be viewed as counters, where i counts up 
from to n — 1, for each decoded bit; while I counts down to 
0, from a value between 1 and (log2(n) — 1), for each stage. 
The decoding of a codeword takes 2ti — 2 clock cycles overall, 
as demonstrated in Equation (|4|. 

Counter /, unlike i, is not reset to a fixed value. By 
making use of the partial computations stored in the LLR 
memory, it can be reset to the result of a find-first-bit-set 
(ffs) operation on i, corresponding to a modified priority 
encoder Specifically, it is reset to ffs* (i + 1) upon reaching 
0, according to Equation ( fTOl l. 



ffs*(a;„i_i . ..xixo) 



min(i) : Xi = 1 

m — 1 



if x>0 
if a; = 



(10) 

Another control signal deals with the function that the 
processing elements must perform on behalf of a specific 
stage. Since the nodes of a given stage all perform the same 
function at any given time, this signal can be used to control 
all the PEs of the line. The function selection is performed 
using Equation ( fTTI ). 



selector f^g{im ■ ■ ■ iiio, I) 



if j; = 
if i( = 1 



(11) 



5) Memory control: Both the LLR and the partial sum 
memory require significant multiplexing in order to route the 
proper values from both memories to the PEs, and vice versa. 

The multiplexer network mapping the inputs of the process- 
ing elements to the LLR memory use the mapping shown in 
Equation (fTSl i. 



MAPf 



%P) = 



MEMLLR(2n - 2'+2 + 2p) for La 

MEMLLR(2n - 2'+2 + 2p + I) for U 

(12) 

where < p < — I) is the index of the PE in the line. 
This mapping assumes that the original codeword is stored in 
memory MEMllr(0 : n - 1). 

The resulting computation is then stored according to the 
mapping shown in Equation ( fTSl l. noting that only the first 2' 
PEs of the line are active in stage I. 



MAPf 



=MEMllr(271-2 



i+i 



P) 



(13) 



Once stage has been activated, the output of PEo contains 
the LLR of the decoded bit i, and a hard decision Ui can be 
obtained from this soft output using Equation ([TJ; in other 
words, if sign(LLR) = 0. At this point, if bit i is known to 
be a frozen bit, the output of the decoder is forced to Ui = 0. 

Once bit Ui has been decoded, this value must be reflected 
in the partial (modulo-2) sums Ug of the decoding graph. 
Algorithm[T]determines, for each g node with index z in stage 
I, whether it must be updated. 

One can note that the original decoding graph contains 
^ log2 (n) such partial sums, but that a maximum of n — 1 
of them are used for the decoding of any given bit. With 



Algorithm 1 Partial sums updating algorithm 
z* -h- bitreverse(z) 
if li = tlien 

if ? = m - 1 or i(,„_l):((+l) = Z*m-2):l ^611 

if (/ = 0) or ((not(i;:o) and Z;*q) = 0) then 

Node z updates its partial sum with Ui 
end if 
end if 
end if 



some careful time-multiplexing, it is thus possible to reduce 
the number of memory cells used to hold the partial sums to 
n — 1, a clear reduction in complexity. This is the approach 
taken in this paper. 

Finally, the mapping shown in Equation ( fT4b connects the 
partial sum input of PEp to the partial sums memory. 

MAp"™"=^''^''(?,p) = MEMfi^(n - 2'+^ + p) (14) 

All of these mapping equations, together with Algorithm [T] 
are efficiently implemented with combinational logic. 

C. ASIC synthesis results 

In order to evaluate the silicon footprint of the line decoder, 
a generic RTL description of the architecture was designed, 
and synthesized using a standard cell library. 

This generic description enabled us to generate specific 
line decoder instances for any code length n, code rate R, 
target signal-to-noise ratio SNR, and quantization level q. 
Syntheses were carried out to measure the impact of these 
parameters on area, using Cadence RTL Compiler v9.1 and 
the TSMC 65nm worst case CMOS standard cell Ubrary. 
Synthesis was driven by Physical Layout Estimators (PLE), 
which allow a more accurate estimation of interconnection 
delays and area, compared to the classical wire-load model. 
The target frequency was set to 500MHz. 

A first set of decoders was generated for 8 < n < 1024 
and 4 < q < 6. Figure IIV-CI shows the evolution of area as 
code size and quantization increase. As expected, area grows 
linearly with n and q. The linear hardware complexity of the 
line decoder validates Equation dTji. 

Then, a second set of decoders was generated and syn- 
thesized for n ~ 1024, for different codes rates. Synthesis 
results confirmed that the code rate does not impact hardware 
complexity. This was expected because the frozen bits are 
stored in a ROM, whose size is constant; only its contents 
changes, according to the code rate and target SNR. 

Finally, a set of decoders was generated for 8 < n < 1024 
and (7 = 5. The area of each component block was extracted, 
in order to estimate their relative complexity share inside the 
decoder Results are shown in Figure |V] Memory resources 
(register banks), processing logic, and multiplexing, represent 
38%, 36%, and 26% of the total area, respectively. The control 
logic is negligible (< 1%), which is expected as it grows 
logarithmically in n. 

''Nominal supply voltage and temperature are Vdd = O.QV and T = 
125°C respectively 
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64 128 256 512 1024 



Code length n 

Fig. 8. Line decoder ai'ea for different quantization and code lengtlis, TSiVIC 
65nm, f=500]y[Hz 

D. Verification 

Verification of the hardware design was carried out by 
means of functional simulation. Specifically, a testbench was 
written to exercise the decoder using lO'^ to 10^ randomly- 
generated noisy input vectors. The output of the simulated 
hardware decoder was then compared to its software counter- 
part, whose error-correction capabilities had previously been 
verified experimentally. This validation was repeated for var- 
ious combinations of SNR and code lengths to ensure good 
test coverage. 




64 128 256 512 1024 



Code length n 

Fig. 9. Line decoder area repartition for different code lengths and 9 = 6, 
TSMC 65nm, f=500MHz 
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V. Conclusion 

Polar codes have recently generated great interest from 
a theoretical point of view. In this paper, we explore the 
hardware implementation of polar code decoders; we propose 
two SC decoders architectures with linear complexity. Soft- 
ware simulations allowed us to validate the proposed min-sum 
approximation, and to determine implementation parameters, 
such as the quantization level. For the most efficient decoder — 
the line-decoder — we provided a detailed description of each 
component block. Logic synthesis using a standard cell library 
confirmed the linear evolution of hardware complexity with 
respect to the code length. 



References 

[1] E. Arikan, "Channel polarization: A method for constructing capacity- 
achieving codes for symmetric binary-input memoryless channels," IEEE 
Trans, on Inform. Theory, vol. 55, no. 7, pp. 3051 -3073, Jul. 2009. 

[2] I. Tal and A. Vardy, "How to construct polar codes," submitted to IEEE 
Trans. Inform. Theory, available online as arXiv: 1105 . 6164v2, 
2011. 

[3] E. Sasoglu, E. Telatai; and E. Arikan, "Polarization for arbitrary discrete 

memoryless channels," in Proc. IEEE Information Theory Workshop ITW 

2009, 2009, pp. 144-148. 
[4] H. Mahdavifar and A. Vardy, "Achieving the secrecy capacity of wiretap 

channels using polai' codes," in IEEE ISIT 2010, Jun. 2010, pp. 913 -917. 
[5] N. Hussami, R. Urbanke, and S.B. Korada, "Performance of polar codes 

for channel and source coding," in IEEE ISIT 2009, Jun. 2009, pp. 1488 

-1492. 



